Gleaning the Web
نویسنده
چکیده
which threatens to swamp the Internet's promised productivity gains, educational benefits, and entertainment value. In recent years, computer science has risen to this challenge, with substantial progress on systems for retrieving and filtering text. Information extraction systems provide a complementary service. IE is the task of identifying the specific fragments of a single document that constitute its core semantic content. IE from weather reports, for example, might involve identifying locations, dates, and high and low temperatures. IE from apartment listings would find neighborhoods, numbers of bedrooms, rents, and telephone numbers. Scalability is the major challenge to IE. IE systems usually rely on extraction rules tailored to a particular document collection. If this knowledge is hand-crafted, porting an IE system to new collections will be expensive. Recent research has led to the identification of important classes of Internet IE tasks for which highly scal-able systems have been developed. In this article, I • describe these IE tasks and explain how machine learning yields highly scalable IE systems, and • discuss remaining challenges and argue that scaling up AI applications on the Internet is an important challenge to machine learning. IE is a key enabling technology for several Internet-taming strategies, such as the information integration systems that present a unified view of heterogenous Inter-net sources. 1,2 A movie-information inte-grator, for example, might provide a single query interface to the movie review, cast list, and schedule information available from dozens of Internet sites. Users pose queries to the interface. The integrator decomposes each query into subqueries against the relevant sources, and then combines the results (see Figure 1). Information integration systems operate by interpreting the Internet sites not as free text, but as structured database-like knowledge sources. To do so, the system processes site documents to extract the relevant text fragments (movie titles, actor names, and so on), while discarding extraneous material such as HTML tags or advertisements. The integration system uses a library of wrappers—each wrapper is an IE system customized for a particular Internet site (see Figure 2). One might expect this Internet IE task to be inherently unscalable because • source documents are designed for people , and few sites provide machine-readable specifications of their formatting conventions; • ad hoc formatting conventions used at one site are rarely relevant elsewhere, so a new wrapper must be built for each additional site; and • sites often change their formatting—a wrapper that …
منابع مشابه
A light-weight Web-at-a-Glance system for intelligent information retrieval
Web-at-a-Glance (WAG) is a system to assist the user in information retrieval and discovery by gleaning the most relevant information from a web site or several web sites. This paper presents this new approach for intelligent information retrieval from web sites, and describes the prototyping of the light-weight WAG system as an active index system. q 1998 Elsevier Science B.V. All rights reser...
متن کاملEasy-DOR : un système de gestion des besoins web au sein d'un groupe d'utilisateurs
Today, more and more users are gleaning information via networks (e.g. Internet and in particular the World Wide Web). To perform this task they use either browsing, searching or filtering tools. These tools are useful for the needs of a single user but are not adapted to a group of users having any information needs. This paper introduces Easy-DOR, a system which enables users to automatically...
متن کاملResource Gleaning, From Earlier Times to the Information Age
Inspired by the film documentary The Gleaners and I, the paper defines two senses of gleaning: (1) generally, the collection of items in small quantities, and (2) more specifically, the collection of items missed or rejected during previous harvesting. As an activity, gleaning in both senses is a neglected but essential activity in the solving of problems of lack of resource, especially now in ...
متن کاملRevisiting adaptations of neotropical katydids (Orthoptera: Tettigoniidae) to gleaning bat predation
All animals have defenses against predators, but assessing the effectiveness of such traits is challenging. Neotropical katydids (Orthoptera: Tettigoniidae) are an abundant, ubiquitous, and diverse group of large insects eaten by a variety of predators, including substrate-gleaning bats. Gleaning bats capture food from surfaces and usually use prey-generated sounds to detect and locate prey. A ...
متن کاملBehavioural flexibility: the little brown bat, Myotis lucifugus, and the northern long-eared bat, M.septentrionalis, both glean and hawk prey
We present behavioural data demonstrating that the little brown bat, Myotis lucifugus, and the northern long-eared bat, M. septentrionalis, can glean prey from surfaces and take prey on the wing. Our data were collected in a large outdoor flight room mimicking a cluttered environment. We compared and analysed flight behaviours and echolocation calls used by each species of bat when aerial hawki...
متن کاملPerception of silent and motionless prey on vegetation by echolocation in the gleaning bat Micronycteris microtis
Gleaning insectivorous bats that forage by using echolocation within dense forest vegetation face the sensorial challenge of acoustic masking effects. Active perception of silent and motionless prey in acoustically cluttered environments by echolocation alone has thus been regarded impossible. The gleaning insectivorous bat Micronycteris microtis however, forages in dense understory vegetation ...
متن کامل